Binary Classification Models Overview

Model Core Idea Strengths Limitations
Decision Tree Splits data using thresholds on features. Interpretable; handles nonlinearity. Prone to overfitting; unstable with small changes.
Support Vector Machine (SVM) Finds optimal hyperplane maximizing margin. Effective in high dimensions. Sensitive to kernel choice; slow for large data.
Neural Network (MLP) Learns hierarchical nonlinear relationships. Handles complex patterns; flexible. Opaque (“black box”); computationally heavy.

How Axis Parallel Decision Trees Split

Let’s Increase the Tree Depth

Custom Implementation of Decision Trees

Difference Advantages Limitations
Custom Tree Makes pruning decisions at each node during tree growth Memory efficient Greedy approach
Rpart Grows the full tree first, then prunes back Considers all pruning possibilities Higher Memory usage
Code
base_r_tree <- generate_tree(y_train, 
                             features = x_train, 
                             criteria_type = "info_gain", 
                             #actual depth + 1 
                             max_depth = 4, 
                             alpha = 0.0128
)
print_tree(base_r_tree)
 Petal.Length <= 4.85 (n = 80 )
  Left:
   Predict: versicolor (n = 41 )
  Right:
   Petal.Length <= 4.95 (n = 39 )
    Left:
     Sepal.Length <= 6.5 (n = 3 )
      Left:
       Predict: virginica (n = 2 )
      Right:
       Predict: versicolor (n = 1 )
    Right:
     Predict: virginica (n = 36 )

In-sample error: 5 % Out-sample error: 10 %

Comparison with R-part Tree

Oblique Decision Split with SVM

Our SVMODT Implementation in Context

Feature Our SVMODT Approach Literature Context Practical Advantage
Tree Construction Recursive linear SVM at each node. Similar to Nie (2019) DTSVM. Fast training; scalable to large datasets.
Split Criterion SVM decision values determine splits. Standard approach across all methods. Mathematically principled; maximizes margin.
Scaling Node-specific scaling at each split. Novel contribution; not in literature. Prevents feature scale issues; more robust.
Class Weights Balanced or custom weights per node. Similar to Bala & Agrawal (2010). Handles imbalanced data effectively.
Feature Selection Random/Mutual Info/Correlation with penalties. Enhanced with penalties (novel). Promotes feature diversity; reduces overfitting.
Hyperparameters depth, min_samples, max_features. Better than kernel methods. Easy to tune; less prone to overfitting.

SVMODT SPLIT WITH MAX-DEPTH 2

Code
tree <- svm_split(
 data = train_data,
 response = "Species", 
 cost = 2, 
 max_depth = 2,
 class_weights = "balanced",
 )

How SVMODT Classifies Observations

Code
print_svm_tree(tree, show_penalties = FALSE)
 🌳 Node: depth = 1 | n = 80 | features = [Petal.Length,Sepal.Length] | max_feat = 2
 ├─ Left branch (SVM > 0):
│   🌳 Node: depth = 2 | n = 40 | features = [Petal.Length,Sepal.Length] | max_feat = 2
│   ├─ Left branch (SVM > 0):
│  │   🍃 Leaf: predict = versicolor | n = 33
│   └─ Right branch (SVM ≤ 0):
│      🍃 Leaf: predict = versicolor | n = 7
 └─ Right branch (SVM ≤ 0):
    🌳 Node: depth = 2 | n = 40 | features = [Petal.Length,Sepal.Length] | max_feat = 2
    ├─ Left branch (SVM > 0):
   │   🍃 Leaf: predict = virginica | n = 5
    └─ Right branch (SVM ≤ 0):
       🍃 Leaf: predict = virginica | n = 35

Tracing Prediction from SVMODT

Code
trace_prediction_path(tree = tree, sample_idx = 4, sample_data = test_data)
=== Tracing Prediction Path ===
Sample 4 :
   Species = 1 
   Petal.Length = 4.4 
   Sepal.Length = 6.7 

 🌳 Node 1 : features = Petal.Length,Sepal.Length 
   SVM decision value: 2.7364 
   → Going LEFT (decision > 0)
   🌳 Node 2 : features = Petal.Length,Sepal.Length 
     SVM decision value: 2.921 
     → Going LEFT (decision > 0)
     🍃 FINAL: Predict versicolor (n = 33 )
     Path taken: LEFT → LEFT 

Final prediction: versicolor 
[1] "versicolor"

Wisconsin Breast Cancer Data

Response: Diagnosis (M = malignant, B = benign)

Features:

a) radius (mean of distances from center to points on the perimeter)

b) texture (standard deviation of gray-scale values)

c) perimeter

d) area

e) smoothness (local variation in radius lengths)

f) compactness (perimeter2 / area - 1.0)

g) concavity (severity of concave portions of the contour)

h) concave points (number of concave portions of the contour)

i) symmetry

j) fractal dimension (“coastline approximation” - 1)

How well does a Decision Tree Perform on WDBC Data?

Fitting SVMODT on WDBC

Code
tree <- svm_split(
 data = train_data,
 response = "diagnosis", 
 cost = 2, 
 max_depth = 2,
 max_features = 2,
 feature_method = "mutual", # other option: random(default), cor
 class_weights = "balanced"
 )

Penalizing Used Features

Code
tree <- svm_split(
 data = train_data,
 response = "diagnosis", 
 cost = 2, 
 max_depth = 2,
 max_features = 2,
 feature_method = "mutual", #available options: random(default), cor
 class_weights = "none",
 penalize_used_features = TRUE
 )

Adding More Features

In-sample error: 3.08 %

Out-sample error: 2.61 %

Code
tree <- svm_split(
 data = train_data,
 response = "diagnosis",
 max_depth = 3,
 max_features = 5,
 feature_method = "mutual",
 class_weights = "balanced",
 penalize_used_features = TRUE
)

print_svm_tree(tree)
 🌳 Node: depth = 1 | n = 454 | features = [perimeter_worst,area_worst,radius_worst,concave.points_worst,concave.points_mean] | max_feat = 5 | penalty = ✓
 ├─ Left branch (SVM > 0):
│   🌳 Node: depth = 2 | n = 273 | features = [radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean] | max_feat = 5 | penalty = ⚠️
│   ├─ Left branch (SVM > 0):
│  │   🌳 Node: depth = 3 | n = 257 | features = [radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean] | max_feat = 5 | penalty = ⚠️
│  │   ├─ Left branch (SVM > 0):
│  │  │   🍃 Leaf: predict = B | n = 222
│  │   └─ Right branch (SVM ≤ 0):
│  │      🍃 Leaf: predict = B | n = 35
│   └─ Right branch (SVM ≤ 0):
│      🌳 Node: depth = 3 | n = 16 | features = [area_se,fractal_dimension_worst,radius_mean,area_mean,texture_mean]
│      ├─ Left branch (SVM > 0):
│     │   🍃 Leaf: predict = B | n = 14
│      └─ Right branch (SVM ≤ 0):
│         (no right child)
 └─ Right branch (SVM ≤ 0):
    🌳 Node: depth = 2 | n = 181 | features = [texture_mean,texture_worst,area_se,concavity_worst,radius_se] | max_feat = 5 | penalty = ⚠️
    ├─ Left branch (SVM > 0):
   │   🌳 Node: depth = 3 | n = 28 | features = [texture_worst,perimeter_worst,radius_mean,texture_mean,perimeter_mean] | max_feat = 5 | penalty = ⚠️
   │   ├─ Left branch (SVM > 0):
   │  │   🍃 Leaf: predict = B | n = 16
   │   └─ Right branch (SVM ≤ 0):
   │      🍃 Leaf: predict = M | n = 12
    └─ Right branch (SVM ≤ 0):
       🌳 Node: depth = 3 | n = 153 | features = [radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean] | max_feat = 5 | penalty = ⚠️
       ├─ Left branch (SVM > 0):
      │   🍃 Leaf: predict = M | n = 13
       └─ Right branch (SVM ≤ 0):
          🍃 Leaf: predict = M | n = 140

Manipulating Features at Child Node

In-sample error: 3.3 %

Out-sample error: 2.61 %

Code
tree <- svm_split(
 data = train_data,
 response = "diagnosis",
 max_depth = 3,
 max_features = 5,
 #available options: random, constant (default)
 max_features_strategy = "decrease", 
 max_features_decrease_rate = 0.8, 
 feature_method = "mutual",
 class_weights = "balanced"
)

print_svm_tree(tree)
 🌳 Node: depth = 1 | n = 454 | features = [perimeter_worst,area_worst,radius_worst,concave.points_worst,concave.points_mean] | max_feat = 5 | penalty = ✓
 ├─ Left branch (SVM > 0):
│   🌳 Node: depth = 2 | n = 273 | features = [radius_mean,texture_mean,perimeter_mean,area_mean] | max_feat = 4 | penalty = ✓
│   ├─ Left branch (SVM > 0):
│  │   🌳 Node: depth = 3 | n = 251 | features = [radius_mean,texture_mean,perimeter_mean] | max_feat = 3 | penalty = ✓
│  │   ├─ Left branch (SVM > 0):
│  │  │   🍃 Leaf: predict = B | n = 220
│  │   └─ Right branch (SVM ≤ 0):
│  │      🍃 Leaf: predict = B | n = 31
│   └─ Right branch (SVM ≤ 0):
│      🌳 Node: depth = 3 | n = 22 | features = [radius_mean,area_mean,area_se]
│      ├─ Left branch (SVM > 0):
│     │   🍃 Leaf: predict = B | n = 20
│      └─ Right branch (SVM ≤ 0):
│         (no right child)
 └─ Right branch (SVM ≤ 0):
    🌳 Node: depth = 2 | n = 181 | features = [texture_mean,texture_worst,perimeter_worst,concave.points_worst] | max_feat = 4 | penalty = ✓
    ├─ Left branch (SVM > 0):
   │   🌳 Node: depth = 3 | n = 34 | features = [texture_worst,texture_mean,radius_mean] | max_feat = 3 | penalty = ✓
   │   ├─ Left branch (SVM > 0):
   │  │   🍃 Leaf: predict = B | n = 17
   │   └─ Right branch (SVM ≤ 0):
   │      🍃 Leaf: predict = M | n = 17
    └─ Right branch (SVM ≤ 0):
       🍃 Leaf: predict = M | n = 147

Adding Random Features

In-sample error: 2.86 %

Out-sample error: 0.87 %

Code
set.seed(123)
tree <- svm_split(
 data = train_data,
 response = "diagnosis",
 max_depth = 3,
 feature_method = "mutual",
 max_features_strategy = "random",
 max_features_random_range = c(0.1,0.3),
 class_weights = "balanced_subsample",
 penalize_used_features = TRUE
)

print_svm_tree(tree)
 🌳 Node: depth = 1 | n = 454 | features = [perimeter_worst,area_worst,radius_worst,concave.points_worst,concave.points_mean,perimeter_mean,area_mean,area_se,radius_mean] | max_feat = 9 | penalty = ✓
 ├─ Left branch (SVM > 0):
│   🌳 Node: depth = 2 | n = 278 | features = [radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean] | max_feat = 7 | penalty = ⚠️
│   ├─ Left branch (SVM > 0):
│  │   🌳 Node: depth = 3 | n = 220 | features = [radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave.points_mean,symmetry_mean] | max_feat = 9 | penalty = ⚠️
│  │   ├─ Left branch (SVM > 0):
│  │  │   🍃 Leaf: predict = B | n = 200
│  │   └─ Right branch (SVM ≤ 0):
│  │      🍃 Leaf: predict = B | n = 20
│   └─ Right branch (SVM ≤ 0):
│      🌳 Node: depth = 3 | n = 58 | features = [radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave.points_mean] | max_feat = 8 | penalty = ⚠️
│      ├─ Left branch (SVM > 0):
│     │   🍃 Leaf: predict = B | n = 43
│      └─ Right branch (SVM ≤ 0):
│         🍃 Leaf: predict = B | n = 15
 └─ Right branch (SVM ≤ 0):
    🌳 Node: depth = 2 | n = 176 | features = [texture_mean,texture_worst,smoothness_worst,concavity_worst] | max_feat = 4 | penalty = ⚠️
    ├─ Left branch (SVM > 0):
   │   🌳 Node: depth = 3 | n = 33 | features = [radius_se,perimeter_worst,concavity_mean,radius_worst] | max_feat = 4 | penalty = ⚠️
   │   ├─ Left branch (SVM > 0):
   │  │   🍃 Leaf: predict = B | n = 14
   │   └─ Right branch (SVM ≤ 0):
   │      🍃 Leaf: predict = M | n = 19
    └─ Right branch (SVM ≤ 0):
       🌳 Node: depth = 3 | n = 143 | features = [radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave.points_mean,symmetry_mean] | max_feat = 9 | penalty = ⚠️
       ├─ Left branch (SVM > 0):
      │   🍃 Leaf: predict = M | n = 17
       └─ Right branch (SVM ≤ 0):
          🍃 Leaf: predict = M | n = 126

Model Comparison

Code
svmodt_tree <- svm_split(data = train_data,
                         response = "diagnosis",
                         max_depth = 4,
                         feature_method = "mutual", 
                         max_features = 29,
                         max_features_strategy = "decrease",
                         max_features_decrease_rate = 0.5,
                         class_weights = "none",
                         penalize_used_features = TRUE)

Bland-Altman Plot

SVMODT Performance on MIMIC-III Data

Deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012.
Response: Death of a patient that is entering an ICU

Data Cleaning/Feature Engineering

  • Calculated Charlson Comorbidity Index

  • Counted additional diagnoses per admission

  • Kept only numeric predictors for model training


Model SVMODT RBF SVM Linear SVM Decision Tree
AUC 0.790 0.748 0.713 0.767

Current Limitations

1. Hyperparameter Complexity

  • Multiple tuning parameters (max_depth, min_samples, max_features, penalty_weight)

  • Feature selection strategy requires careful consideration

2. Binary Classification Focus

  • Currently designed for two-class problems

3. Interpretability Trade-off

  • SVM decision boundaries less intuitive than pure thresholds

  • Feature interactions harder to explain

4. Computational Considerations

  • Node-specific scaling adds overhead

  • Feature penalty calculations at each split

  • Memory requirements for storing multiple SVM models

Future Directions

  • Automated hyperparameter tuning (Bayesian/meta-learning)

  • Native multi-class node splits

  • SHAP/LIME-based explainability

  • Parallel & approximate SVM scalability

  • Ensemble variants with random feature subsets

Thank You!

Questions?